Introduction To Statistical Inference

Gwendolyn Eadie, Assistant Professor
November 4, 2019

Department of Astronomy & Astrophysics / Department of Statistical Sciences

Unversity of Toronto

By the end of the day, you will...

  • understand what a random variable is
  • be able to explain the difference between joint, conditional, and marginal probability
  • understand what a statistical model is
  • recognize Bayes' Theorem, and understand how to use it in practice
  • know how to plot and write simple scripts in R, and how to use Rstudio
  • be enthusiastic to learn more about statistics!

Probability

plot of chunk unnamed-chunk-1

Sample Space

  • The set of all possible outcomes of an experiment.
    • What would constitute an experiment in astronomy?
  • \( \omega \) is a sample outcome or realization.
    • Can you think of a sample outcome in your example from above?
  • \( \Omega \) is the set of all possible realizations.
    • In your example, what is \( \Omega \)?
  • Subsets of \( \Omega \) are called events (we call each event \( A \))
    • What might you consider an event?

Sample Space

Example: We obtain some data from a star and estimate its temperature.

\( \omega \) is a measurement of the temperature of a star. The event \( A \) that the temperature is higher than 7500K but not higher than 10000K is A = (7500,10000].

Probability (axiomatic)

The probability here is purely mathematical, based on three axioms (definitions):

  1. \( \mathrm{Pr}(A) \geq 0 \) for every \( A \), where \( A \) is an event.

  2. \( \mathrm{Pr}(\Omega) = 1 \).

  3. If \( A_1, A_2, \ldots \) are disjoint (don't contain any common outcomes), then \[ \mathrm{Pr}( A_1 \; \mathrm{or} \; A_2 \; \mathrm{or} \; \ldots ) = \mathrm{Pr}(A_1) + \mathrm{Pr}(A_2) + \ldots \]

What is the probability of \( A_1 \) and \( A_2 \) if they are disjoint?

\[ \mathrm{Pr}(A_1 \; \mathrm{and} \; A_2 \; \mathrm{and} \; \ldots) = 0 \]

More Probability

  • It follows that, for two events \( A \) and \( B \):
    • \( \mathrm{Pr}(\emptyset) = 0 \)
    • If \( A \) is contained in \( B \), \( \;\mathrm{Pr}(A) \leq \mathrm{Pr}(B) \)
    • \( 0 \leq \mathrm{Pr}(A) \leq 1 \)
    • \( \mathrm{Pr} ( \mathrm{not}\; A) = 1 - \mathrm{Pr}(A) \)
    • \( \mathrm{Pr}( A \; \mathrm{or} \; B ) = \mathrm{Pr}(A) + \mathrm{Pr}(B) - \mathrm{Pr}( A \; \mathrm{and} \; B ) \)

Independence

Two events \( A \) and \( B \) are independent if \( \mathrm{Pr}(A \; \mathrm{and} \; B) = \mathrm{Pr}(A)\mathrm{Pr}(B) \)

  • With your neighbour, draw a Venn Diagram of events \( A \) and \( B \) (3 minutes)
  • Suppose that A and B are disjoint events, each with positive probability. Can they be independent?
    • No
  • Why??
    • If \( A \) and \( B \) are independent, then \( \quad \mathrm{Pr}(A)\mathrm{Pr}(B) > 0 \; \) but if they are disjoint, then \( \; \mathrm{Pr}(A\; \mathrm{and} \; B) = \mathrm{Pr}(\emptyset) = 0 \).

Conditional Probability

  • If \( \mathrm{Pr}(B) > 0 \) then the conditional probability of \( A \) given \( B \) is written \[ \mathrm{Pr}(A|B) = \mathrm{Pr}(A \; \mathrm{and} \; B) / \mathrm{Pr}(B) \]
  • In general, \( \mathrm{Pr}(A|B) \neq \mathrm{Pr}(B|A) \)
  • If \( A \) and \( B \) are independent events, then \( \mathrm{Pr}(A|B) = \mathrm{Pr}(A) \). The reverse is also true!
  • Another useful form: \( \mathrm{Pr}(A \; \mathrm{and} \; B) = \mathrm{Pr}(B|A)\mathrm{Pr}(A) = \mathrm{Pr}(A|B)\mathrm{Pr}(B) \)

Law of Total Probability

If \( A_1, A_2, \ldots , A_k \) are a partition of \( \Omega \), then for any event \( B \):

\[ \mathrm{Pr}(B) = \sum_i \mathrm{Pr}(B|A_i)\mathrm{Pr}(A_i) \]

Bayes' Theorem

Suppose \( A_1, A_2, \ldots , A_k \) are a partition of \( \Omega \), and \( \mathrm{Pr}(A_i) > 0 \) for each \( i \).

If \( \;\mathrm{Pr}(B) > 0 \), then for each \( \; i = 1, \ldots , k \):

\[ \mathrm{Pr}( A_i | B ) = \frac{ \mathrm{Pr}( B | A_i )\mathrm{Pr}( A_i )}{\sum_j \mathrm{Pr}( B | A_j )\mathrm{Pr}( A_j )} \]

\( \mathrm{Pr}(A_i) \) is the prior probability of \( A \), and \( \mathrm{Pr}(A_i|B) \) is the posterior probability of \( A \). Often you'll see Bayes' theorem written more simply, like this: \[ \mathrm{Pr}(A|B) = \frac{\mathrm{Pr}(B|A)\mathrm{Pr}(A)}{\mathrm{Pr}(B)} \]

Toy exercise using Bayes' Theorem - Question (7 minutes)

In this galaxy, 60% of the stellar systems have an Earth-like planet, and all of these systems also have a Jupiter-like planet. However, only half of the systems without an Earth-like planet have a Jupiter-like planet.

What's the probability that a stellar system with a Jupiter-like planet also has an Earth-like planet?

Toy exercise using Bayes' Theorem - Hint

What's the probability that a stellar system with a Jupiter-like planet also has an Earth-like planet?

  • mathematically, we're being asked to find \( \mathrm{Pr}(\text{Earth-like}|\text{Jupiter-like} \))

\[ = \frac{\mathrm{Pr}(\text{Jupiter-like}|\text{Earth-like})\mathrm{Pr}(\text{Earth-like})}{\mathrm{Pr}(\text{Jupiter-like})} \]

  • let's make this easier to look at

\[ = \frac{\mathrm{Pr}(J|E)\mathrm{Pr}(E)}{\mathrm{Pr}(J)} \]

Toy exercise using Bayes' Theorem - Solution

We are told that \( \mathrm{Pr}(E)=0.6 \) and that \( \mathrm{Pr}(J|E)=1 \)

We are also told, somewhat indirectly, that \( \mathrm{Pr}(J)=0.8 \). This follows from \[ \mathrm{Pr}(J) = \mathrm{Pr}(J|E)\mathrm{Pr}(E) + \mathrm{Pr}(J|\text{not E})\mathrm{Pr}(\text{not E}) \] \[ \mathrm{Pr}(J) = (1)(0.6) + (0.5)(0.4) = 0.6 + 0.2 = 0.8 \]

Now you can solve Bayes' theorem to find \[ \mathrm{Pr}(E|J) = \frac{P(J|E)P(E)}{P(J)} = \frac{(1)(0.6)}{0.8} = 0.75 \]

Bayes' Theorem --- How do you use it for data analysis?

  • in reality, we aren't told that 60% of the stars in a galaxy host an Earth-like planet!
  • we must assign a model describing the the probability and then estimate the model's parameters using data and any prior information we have
  • more on this in the afternoon!
  • … and for that we need to know more about probability distributions

Beyond the Gaussian distribution...

What other distributions have you heard of?

Relationships between univariate distributions

How do I know what to use when I don't know all the distributions?

Think about the sampling process, and understand what kind of random variable you are dealing with.

alt text

Random Variables

plot of chunk unnamed-chunk-2

Random Variables

  • Random variables can be used to link sample spaces and events to data.
  • A random variable \( X(\omega) \) assigns a real number to each outcome \( \;\omega \). Note that \( X \) is a function of \( \omega \).
  • Example: Suppose that we have an instrument that classifies a star as either “blue” \( B \) or “not blue” \( N \).
    • We define a random variable \( X \) to be the number of blue stars measured in a sample of ten stars.
    • If we measure ten stars using this instrument, we might get \[ \omega = BBBNNBBBBN \]
    • What is \( X(\omega) \)?
    • Then \( X(\omega) = 7 \).

Cumulative Distribution Function

The cumulative distribution function (CDF) is \( F_X(x) = \mathrm{Pr}( X \leq x ) \)

In other words, it gives the probability of a random variable \( X \) being less than or equal to the value \( x \).

  • A CDF has three properties:
    1. Non decreasing: \( x_1 < x_2 \) implies that \( F(x_1) \leq F(x_2) \)
    2. Normalized: \( F(x) \) goes from \( 0 \) to \( 1 \) as \( x \) goes from \( -\infty \) to \( +\infty \)
    3. Right continuous

Example: Cumulative Distribution Function for Standard Normal Distribution

Probability Mass Function

The probability mass function (PMF) for \( X \) is \( f_X(x) = \mathrm{Pr}( X = x ) \).

It exists only for \( X \) that takes countably many discrete values.

Example: \( X \) could be defined as the number of stars measured until we find 22 blue ones with a mass less than 1 solar mass. Note that we could conceivably count an infinite number of stars before we ever get 22 blue ones less than 1 solar mass (if the universe contains infinitely many stars and we could measure them all!).

Probability Density Function

The probability density function (PDF) is a function \( f_X(x) \geq 0 \) for all \( x \), which integrates to \( 1 \), and for every \( a \leq b \)

\[ \mathrm{Pr}(a < X < b) = \int_a^b f_X(x) dx \]

If such a function exists, then \( X \) is a continuous random variable with CDF

\[ F_X(x) = \int_{-\infty}^x f_X(t) dt \]

and \( f_X(x) \) is the derivative of \( F_X(x) \) at all points where \( F_X \) is differentiable.

Quantile Function

  • The inverse CDF is also known as the quantile function \( F_X^{-1}(q) = \) smallest value of \( x \) such that \( F_X(x) > q \), for \( q \) between \( 0 \) and \( 1 \).
  • If \( Q \) is a \( \mathrm{Uniform}(0,1) \) random variable, i.e. \( F_Q(q) = q \) for \( 0 \leq q \leq 1 \), then the random variable \( X = F_X^{-1}(q) \) has CDF \( F_X \).
    • This can be used to generate random draws from a distribution \( F_X \) using \( \mathrm{Uniform}(0,1) \) random number generators.

Example: Quantile Function for Standard Normal

Joint, Conditional, or Marginal Distributions?

The terms joint distribution, conditional distribution, and marginal distribution get thrown around a lot.

What's the difference?

Joint Distribution

The joint distribution is the multidimensional probability distribution of two or more random variables.

If X and Y have a joint distribution, then we can write the PMF or PDF as \( f_{X,Y} \). Let's look at an example…

# draw 5,000 samples from a multivariate normal distribution
samples = mvrnorm(n = 5000, mu = c(1,1), Sigma = matrix(data = c(0.3, 0.1, 0.3, 0.5), nrow = 2, ncol = 2, byrow = TRUE))
# name the columns
colnames(samples) = c("theta1", "theta2")

# make samples a data frame
samples = as.data.frame(samples)

Example: Joint Distribution

plot of chunk unnamed-chunk-5

Conditional Distribution

The conditional distribution is the distribution of a parameter given a specific value of the other parameter(s).

  • For discrete random variables, the conditional PMF is \[ f_{X|Y}(x|y) = \mathrm{Pr}( X=x | Y=y )\\ = \frac{\mathrm{Pr}( X=x, Y=y )}{\mathrm{Pr}( Y=y )}\\ = \frac{f_{X,Y}(x,y)}{f_Y(y)}\; \mathrm{if}\; f_Y(y) > 0 \]
  • For continuous random variables, the conditional PDF is \[ f_{X|Y}(x|y) = \frac{f_{X,Y}(x,y)}{f_Y(y)}\; \mathrm{if} \; f_Y(y) > 0 \]

Example: Conditional Distribution

For example, in the previous plot, we could look at the conditional distribution of \( \theta_2 \) given that \( \theta_1=1.1 \). We denote this as \[ p(\theta_2 | (\theta_1=1.1) ). \]

plot of chunk unnamed-chunk-6plot of chunk unnamed-chunk-6

Example: Conditional Distribution (cont'd)

NOTE: The conditional distribution will look different for different values of \( \theta_1 \). For example, the conditional distribution of \( \theta_2 \) given that \( \theta_1=0.2 \) looks like the green curve on the right: \[ p(\theta_2 | (\theta_1=0.2) ). \]

plot of chunk unnamed-chunk-7plot of chunk unnamed-chunk-7

Marginal Distributions

The marginal distribution is the distribution of a parameter regardless of the values of the other parameter(s).

  • If X and Y have a joint distribution with PMF \( f_{X,Y} \), then the marginal mass function for \( X \) is \[ f_X(x) = \mathrm{Pr}( X=x ) = \sum_y \mathrm{Pr}( X=x, Y=y )\\ = \sum_y f_{X,Y}(x,y) \]
  • If X and Y have a joint distribution with PDF \( f_{X,Y} \), then the marginal density function for \( X \) is \[ f_X(x) = \int f_{X,Y}(x,y) dy \]

Example: Marginal Distribution

plot of chunk unnamed-chunk-8

Independence of Random Variables

Suppose \( X \) and \( Y \) have joint PDF \( f_{X,Y} \).

Then \( X \) and \( Y \) are independent random variables if (and only if)

\[ f_{X,Y}(x,y) = f_X(x)f_Y(y) \] for all values of \( x \) and \( y \).

In other words, the indepedence we talked about earlier (\( \mathrm{Pr}(A,B)=\mathrm{Pr}(A)\mathrm{Pr}(B) \)) extends to probability density functions

Expectation

plot of chunk unnamed-chunk-9

Expected Value of a Random Variable

Expected value, or mean, or first moment. A one-number of summary of a distribution.

  • For a random variable \( X \), \[ E\left[X \right]=\int xdF(x)=\begin{cases} \sum_{x}xf(x) & X\;\mathrm{discrete}\\ \int xf(x)dx & X\;\mathrm{continuous} \end{cases} \]
  • You might see \( E\left[ X \right] \) or \( \mu_X \) or maybe just \( \mu \).
  • It might also bring back memories of the “expectation value” in quantum mechanics \( \langle x \rangle \).

Properties of Expected Values

If \( X_1, X_2, ..., X_n \) are random variables and \( a_1, a_2, ..., a_n \) are constants, then \[ E\left[ \sum_i a_i X_i \right] = \sum_i a_i E\left[ X_i \right] \]

If the random variables are independent, then \[ E\left[ \prod_i X_i \right] = \prod_i E\left[ X_i \right] \]

Variance

Variance summarises the spread of a distribution.

If \( X \) is a random variable with mean \( \mu \), the variance of \( X \) is \[ \mathrm{Var}(X) = E\left[ (X - \mu)^2 \right] = \int (x-\mu)^2 dF(x) \]

(we use \( dF(x) \) to symbolize that this could be for discrete or continuous random variables)

The variance is often denoted by \( \sigma_X^2 \) or \( \sigma^2 \). The standard deviation is \( \sqrt{\mathrm{Var}(X)} = \sigma_X \).

Variance Properties

  • \( \mathrm{Var}(X) = E\left[X^2 \right] - \mu^2 \)
  • If \( a \) and \( b \) are constants, then \( \mathrm{Var}(aX+b) = a^2\mathrm{Var}(X) \)
  • If \( X_1, X_2, ..., X_n \) are independent random variables and \( a_1, a_2, ..., a_n \) are constants, then \[ \mathrm{Var}\left( \sum_i a_i X_i \right) = \sum_i a_i^2 \mathrm{Var}\left( X_i \right) \]

Sample Mean and Variance

If \( X_1, X_2, ..., X_n \) are random variables (assuming they follow the same distribution), then the sample mean is \[ \bar{X} = \frac{1}{n}\sum_{i=1}^n X_i \]

and the sample variance is \[ S^2 = \frac{1}{n-1}\sum_{i=1}^n(X_i-\bar{X})^2 \]

Now we're getting somewhere! \[ E[\bar{X}]=\mu\qquad\mathrm{Var(\bar{X})=\frac{\sigma^{2}}{n}\qquad E[S^{2}]=\sigma^{2}} \]

Covariance

If \( X \) and \( Y \) are random variables with means \( \mu_X \) and \( \mu_Y \), and standard deviations \( \sigma_X \) and \( \sigma_Y \), then the covariance between \( X \) and \( Y \) is

\[ \mathrm{Cov}(X,Y) = E\left[ (X-\mu_X)(Y-\mu_Y) \right] \]

The correlation between \( X \) and \( Y \) is

\[ \rho_{X,Y} = \frac{\mathrm{Cov}(X,Y)}{\sigma_X \sigma_Y} \]

Covariance Properties

  • \( \mathrm{Cov}(X,Y) = E[XY] - E[X]E[Y] \)
  • If \( X \) and \( Y \) are independent, then \( \mathrm{Cov}(X,Y) = 0 \).
    • The converse is not generally true!
  • \( \mathrm{Var}(aX+bY) = \\a^2\mathrm{Var}(X) + b^2\mathrm{Var}(Y) + 2ab\mathrm{Cov}(X,Y) \)

Models and Statistical Inference

plot of chunk unnamed-chunk-10

Statistical Inference

(learning for those in computer science…)

Question: Given data \( x_1, x_2, ..., x_n \), obtained by drawing from some unknown distribution \( F \), how do we infer \( F \)?

  • We might not actually care to know everything about \( F \). For instance, maybe we only want to know about the mean of \( F \).

Models, Parametric and Nonparametric

  • A statistical model is a set of distributions (or densities) that could be used to describe \( X_1, X_2, ..., X_n \).
  • A parametric model is a statistical model that can be parameterized by a finite number of parameters.
    • For example, if we assume that the data come from a Poisson distribution, then the model is the set of all Poisson distributions with \( \lambda > 0 \).
  • A nonparametric model is a statistical model that can't be parameterized by a finite number of parameters.
    • For example, we might consider every possible \( F \).

Statistical Inference Examples

  • Assume that \( X_1, X_2, ..., X_n \) are independent and identically distributed as \( N(\mu,\sigma^2) \). We are interested estimating the parameter \( \mu \). We call \( \mu \) the parameter of interest. The remaining parameter \( \sigma^2 \) is called a nuissance parameter.
  • Suppose that \( X_1, X_2, ..., X_n \sim f_X \), and that we are interested in estimating \( \mu = E[X_1] \), assuming that \( \mu \) exists. We can think of \( \mu \) here as a function of \( F_X \).
  • Question: Are each of these examples parametric or nonparametric?

Point Estimation

  • A point estimate is our single best guess at some unknown quantity.
  • We denote a point estimate of \( \theta \) by \( \hat{\theta} \).
    • The quantity \( \theta \) is fixed but unknown.
    • The estimate \( \hat{\theta} \) depends on the data and is a random variable.
    • The distribution of \( \hat{\theta} \) is called the sampling distribution.
    • The standard deviation of \( \hat{\theta} \) is called the standard error.

Properties of Point Estimates

  • The bias of an estimator is: \( \mathrm{bias}(\hat{\theta}) = E[\hat{\theta}] - \theta \)
    • Unbiased estimators used to be paramount, but that's no longer the case.
  • The mean squared error or MSE of an estimator is: \[ \mathrm{MSE}(\hat{\theta}) = E[(\hat{\theta}-\theta)^2] = \left( \mathrm{bias}(\hat{\theta}) \right)^2 + \mathrm{Var}(\hat{\theta}) \]
    • Estimators with small MSEs are good: They will tend to result in estimates that are close to the true but unknown quantity of interest.

Confidence Sets

  • A \( 1-\alpha \) confidence interval for a parameter \( \theta \) is an interval \( C = (a,b) \) where \( a \) and \( b \) are functions of the data chosen so that \[ \mathrm{Pr}(\theta \in C) \geq 1-\alpha \] for all possible values of \( \theta \).
  • Note: It is the interval \( C \) that is random. The true \( \theta \) is fixed! A confidence interval is NOT a probability statement about \( \theta \)!
  • It's normal to be confused at this point. So, let's look at an example

Toy Example: Confidence Interval

An astronomer wanted to know the proportion of stars that are in a binary system. Let's pretend that in nature the true proportion of stars in a binary system is 80%. However, they only had a sample size of 70 objects in the sky.

The astronomer's data are binomial; either a star is in a binary system or it is not. They report a proportion of stars in binary systems of \( 0.771 \) with a 95% confidence interval of \( (0.673, 0.870) \).

What does this interval mean? Discuss with a neighbour.

Confidence Interval Applet http://www.rossmanchance.com/applets/ConfSim.html

Hypothesis Testing

Hypothesis testing assumes a default theory (null hypothesis) and devises a test to examine whether there is sufficient evidence in the data to reject the default theory.

  • With reference to a parameter \( \theta \), the null hypothesis might look something like: \( H_0: \theta = \theta_0 \)
  • An alternative hypothesis might be: \( H_1: \theta \neq \theta_0 \)
  • We contruct a test statistic \( T(X) \), often based on the sampling distribution and \( \theta_0 \), and examine whether its value is consistent with \( H_0 \).
    • If not, then \( H_0 \) is rejected in favour of \( H_1 \).

Bayesian Inference

plot of chunk unnamed-chunk-11

The Bayesian Method

Our choice of parametric model provides a framework in which to perform statistical inference. Previously, we considered the parameter \( \theta \) to be fixed but unknown.

Under the Bayesian inference paradigm, we instead take \( \theta \) to be random, and consider a distribution for \( \theta \) that represents our belief about its value before and after observing data.

The Bayesian Method in 3 Steps

  1. Specify a prior distribution \( f(\theta) \) for the model parameter \( \theta \) that reflects our belief about its value.
  2. Choose a statistical model \( f(x|\theta) \) for the data given the value of the parameter.
  3. After observing the data \( X_1, X_2, \ldots, X_n \), update our prior belief using Bayes' Theorem to obtain the posterior distribution \( f(\theta|X_1,X_2,\ldots,X_n) \).